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Abstract  —  The  K-Nearest  Neighbor  Attractor-based  Neural 
Network  and  the  Optimal  Linear  Discriminatory  Filter  Classifier 
are  feature-based  classifiers  that  are  trained  via  supervised 
learning  using  a  training  set  of  feature  vectors.  They  were 
developed  by  the  author  and  successfully  used  in  several 
applications  where  they  were  "and-ed."  Results  using  these 
classifiers  have  been  published,  but  comprehensive  descriptions  of 
them  have  not  appeared  in  the  literature.  This  paper  presents 
their  detailed  descriptions. 

I.  INTRODUCTION 

The  K-Nearest  Neighbor  Attractor-based  Neural  Network 
(KNNANN)  is  a  probabilistic-type  of  neural  network.  It  was 
developed  based  on  concepts  taken  from  the  Probabilistic 
Neural  Network  (PNN  [1]),  Reduced  Coulomb  Energy  neural 
network  (RCE  [2]),  and  the  K-Nearest  Neighbor  classifier 
(KNN  [2]).  It  estimates  the  joint  posteriori  probability, 
Probfclass  \  feature  vector),  from  training  data.  Specifically, 
the  KNNANN  training  process  produces  a  joint  posteriori 
histogram  of  the  training  set  under  the  constraint  of  equally 
likely  classes  even  if  the  training  set  is  unbalanced  (i.e.,  each 
class  has  a  different  number  of  training  samples.) 

The  Optimal  Linear  Discriminatory  Filter  Classifier 
(ODFC)  is  a  classifier  that  uses  a  bank  of  linear  classifiers  that 
are  referred  to  as  linear  discriminatory  filters.  The  coefficients 
of  the  linear  classifiers  are  found  by  solving  a  generalized 
eigenvalue  problem.  There  are  generic  similarities  between 
ODFC  optimization  approach  and  the  one  described  in  [3]. 
The  outputs  of  the  linear  classifiers  are  combined  to  produce  an 
overall  statistic  that  is  used  to  determine  class  membership. 

These  two  classifiers  are  complementary  in  how  they 
construct  decision  boundaries  in  feature  space.  The  KNNANN 
uses  multiple  hyperspheres  (also  referred  to  as  attractors)  to 
define  class  boundaries,  and  the  ODFC  uses  multiple 
hyperplanes.  As  such  they  have  been  combined  by  the  author 
using  various  "anding"  schemes,  which  resulted  in  overall 
performance  that  was  much  better  than  the  performance  of 
either  one  alone. 

A  desirable  attribute  of  both  classifiers  is  that  their  training 
processes  are  not  iterative  and  are  computationally  very  fast. 
This  permits  the  designer  to  efficiently  evaluate  many  feature 
subsets  from  large  set  of  candidate  features  and  select  the 
subset  that  performs  "best"  or  "good  enough,"  thereby 
overcoming  Bellman's  well-known  curse  of  dimensionality  [4]. 

These  classifiers  were  developed  by  the  author  several 
years  ago  and  used  in  several  applications  [5-8].  In  section  III 
the  KNNANN  is  described  in  detail,  and  in  section  IV  the 
ODFC  is  described. 


II.  FEATURE  VECTOR  NORMALIZATION 

Both  classifiers  perform  better  if  feature  vectors  are 
normalized.  The  normalization  process  is  described  in  this 
section.  Define 

x(k)  =  k- th  feature  vector  from  the  training  set.  (1) 

class(k)  =  class  of  x(k)  s  {1,  2,  ....  NJ 
where 

Nc  =  Number  of  classes 

k  =  1,2,  ...  N 

N  =  Number  of  training  feature  vectors 

M(i)  =  Number  of  training  vectors  belonging  to  class  i 

Let  xj(k)  denote  the  j- th  feature  component  of  the  k- th 
feature  vector.  The  normalized  feature  component  is  given  by 


Xj(k;  "normalized")  =  (2) 

[  (xj(k;  "original")  -  biasfj)  ]  /scaled) 

where 

w(k)  =  sample  weights  (3) 

=  1  / M(class(k))  for  k  =  1,  2, ...  N 

N 

W=  Z  w(k)  (4) 

k=l 

N 

bias(/')  =  (1/W)  Y  w(k)  xj(k)  (5) 

k=l 

N 

scale(/')  =  (1/W)  Y  w(k)  \  Xj(k)  -  bias(/')  |  (6) 

k=l 


Regarding  this  normalization  process:  (1)  scale  and  bias 
are  computed  using  only  the  training  data  and  must  be  saved 
for  application  to  any  new  feature  vector,  (2)  the  use  of  w(k) 
imposes  an  equally  likely  class  probability  assumption.  This 
prevents  any  class  with  a  much  greater  number  of  samples  than 
the  other  classes  from  dominating  the  calculations.  For  the 
remainder  of  the  paper,  the  feature  vectors  x(k)  are  assumed  to 
be  normalized. 

As  an  aside,  in  some  applications  it  is  sometimes 
beneficial  to  base  the  determination  of  the  bias  and  scale  on 
samples  from  one  key  class  (or  a  few  key  classes)  rather  than 
all  classes  as  presented  here. 
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III.  KNNANN 

The  first  stage  of  the  KNNANN  Training  algorithm 
determines  the  number  of  attractors  (hyperspheres)  that  will 
contain  all  the  feature  vectors  from  training  set,  their  center 
locations  in  feature  space,  and  their  associated  radii.  Let 

c(L)  =  center  of  attractor  L  (7) 

r(L)  =  radius  of  attractor  L 

The  second  stage  of  the  KNNANN  Training  algorithm 
determines  p(i,  Z),  the  estimated  probability  that  an  object 
belongs  to  class  i  given  that  its  feature  vector  is  contained  in 
attractor  L. 

The  KNNANN  Training  algorithm  uses  some  selectable 
design  parameters:  a,  fi  and  y(i): 

a  =  nearest  neighbor  factor  (8) 

where 

0  <  a<  1 

Recommend:  a  =  0.5 

P  =  radius  dither  parameter  (9) 

where 

0  < P<  0.5 

Recommend:  P  =  0.005 

pi)  =  removal  factor  for  i  =  1,  2,  ...  Nc  (10) 

where 

0  <y(i)  <1 

Recommend:  pi)  =  0.0  if  M(i)  is  small 

=  0.5  if  M(i)  is  of  modest  size 
=  1.0  if  M(i)  is  large 

The  selection  and  usage  of  these  parameters  will  be 
discussed  later,  but  first  the  training  algorithm  is  presented. 

The  KNNANN  uses  a  vector  norm,  which  will  be  denoted 
as  norm(n)  for  the  norm  of  vector  u.  The  standard  L2  norm  is 
recommended.  Other  norms  can  be  used  to  derive  different 
variants  of  the  KNNANN. 

KNNANN  Training  Algorithm 

1.  Initialization: 

Set  L  =0 

k  ref  0 

flag(k)  =  0  for  k  =  1,  2,  ...  N 

2 .  Set  k f '(rj  —  k n>f  “t-  1 

If  kref  >  N 

Then:  Go  to  step  12 
Else:  If  flag(kref)  =  0 

Then:  Go  to  step  3 
Else:  Go  to  step  2 

3.  Set  L  =  L+l 

c(L)  =  center  of  attractor 

—  X  (kref) 


4.  Set  d(k)  =  norm( x(k)  -  c(L)  )  and 

v(k)  =  [  M(class(k))  ]  1/2  for  k  =  1,  2,  ...  N. 

5.  Sort  d(k)  from  smallest  to  largest.  Let  km  be  the  sorted 
index  list;  i.e.,  d(kp  <d(k2)  <...<d(kN). 

6.  Find  the  smallest  M  such  that 

M 

Z  V(km)  >  a 
m=l 

7.  Set  r(L)  =  radius  of  the  attractor 

=  0  +  P)  d(kM) 

8.  Set  Q  =  0 

q(i)  =  0  for  i  =  1,  2, ...  Nc. 

9.  For  m  =  1,  2, ...  N 

Set  k  kjyi 
If  d(k)  <r(L), 

Then:  w  =  1  / M(class(k)) 

Q  =  Q  +  w 

q(class(k))  =  q(class(k))  +  w 
If  d(k)  <  pclass(k))  r(L) 

Then:  flag(k)  =  1 
Else:  continue 
Else:  Go  to  step  10. 

End  of  For  m  loop 

10.  For  i  =  1 ,  2, ...  Nc 

p(i,L)  =  q(i)  /  Q 
End  of  For  i  loop 

1 1 .  Return  to  step  2 

12.  Training  is  complete. 

When  training  is  completed,  L  attractors  have  been 
identified  with  radii  r(j),  centers  c(j),  and  estimated  class 
probabilities p(i,j)  for  i  =  1,  2,  ...  Nc  and  j  =  1,  2,  ...  L.  p(i,j) 
is  an  estimate  of  posteriori  probability, 
Probfclass  i  \x  =  c(j)  ),  under  the  assumption  of  equally  likely 
class  probabilities.  As  desired,  note  that 

Nc 

Z  p(i,j)  =  1  (11) 

i=l 

The  following  algorithm  describes  how  this  information  is 
used  to  determine  the  class  of  an  arbitrary  feature  vector,  x. 
The  evaluation  algorithm  uses  a  single  selectable  design 
parameter,  h. 

h  =  attractor  radius  expansion  factor  (12) 

where 
0<h 

Recommend:  h  =  1.5 

The  selection  and  usage  of  parameters  h  will  be  discussed 
later,  but  first  the  evaluation  algorithm  is  presented. 

KNNANN  Evaluation  algorithm 

1.  Initialize:  Q  =  0, 

q(i)  =  0  for  i  =  1,  2, ...  Nc 


KNNANN  design  parameters 


2.  For  j  =  1,  2,  ...  L 

d  =  normfx  -  c(j)) 

If  d<hr(j) 

Then:  w  =  1  / (d  +  0.000001) 

Q  =  Q  +  w 
For  i  =  1,  2,  ...  Nc 

q(i)  =  q(i)  +wp(i,j) 

End  of  For  i  loop 
Else:  continue 
End  of  For  j  loop 

3.  If  Q>0 

Then:  For  i  =  1,  2,  ...  Nc 

z(i  |  x)  =q(i)/Q 
End  of  For  i  loop 
Else:  For  i  =  1 ,  2,  ...  Nc 
z(i  |  x )  =  0 
End  of  For  i  loop 

If  Q  >  0  upon  completion  of  these  three  steps,  z(  i  \  x )  is 
an  estimate  of  Prob  (class  i  \  x).  It  is  easy  to  show  that  the  sum 
of  z(i  |  x )  over  i  is  one.  If  Q  =  0,  then  x  did  not  belong  to  the 
neighborhood  of  any  attractor.  In  this  case,  it  is  assumed  that  x 
is  statistically  dissimilar  to  the  training  set.  Therefore  there  is 
no  basis  for  assigning  a  class;  the  class  is  declared  "unknown." 

Let 

hnax  =  argmax(z(  i\x)  )  (13) 

i 

I  ref  ~  argmaxf  z(  i  \  x  )  ) 

i  / max 

T  =  a  designer  selectable  classification  threshold. 

The  classification  rule  is  given  by: 

If  zflmax  I  x)  >  T z(Iref\  x),  then  class  =  7mar;  otherwise  the 
class  is  considered  uncertain. 

For  a  two-class  problem  (e.g.,  class  1  =  target  and 
class  2  =  nontarget),  the  classification  rule  is  given  by: 

\f  z(  1  \  x )  >  T  z( 2  |  x  ),  then  class  =  7;  otherwise  class  =  2. 

T  can  be  selected  to  adjust  the  operating  point  for 
probability  of  classification  versus  probability  of  false  alarm. 
Furthermore,  if  a  sufficiently  large  set  of  feature  vectors  is 
available  with  their  associated  class  labels,  the  Receiver 
Operator  Characteristic  curve  can  be  estimated  by  varying  T. 
(So  as  not  to  be  fooled  by  possible  overtraining,  one  should  use 
feature  vectors  that  were  not  used  in  the  training  to  generate 
the  ROC  curve.) 

Other  variants  of  interpolation  can  be  used  in  the  KNN 
Evaluation  algorithm  by  simply  changing  the  interpolation 
weight,  w,  in  step  2.  For  example,  one  could  use 

w  =  exp( -a  |  d  / r(j)  |* )  (14) 

for  appropriate  choices  of  parameters  a  >  0  and  b  >  0. 


The  KNNANN  training  algorithm  estimates  a  joint 
histogram  of  the  training  data  where  each  attractor  represents  a 
volume  in  feature  space  in  which  one  counts  the  number  of 
training  samples  that  belong  to  each  class.  The  KNNANN 
counting  process,  which  is  used  to  size  the  attractor  and 
determine  the  sample  probabilities  associated  with  it,  performs 
two  key  functions:  (1)  it  balances  the  training  data  when  the 
classes  may  have  a  disparate  number  of  samples,  and  (2)  it 
sizes  each  attractor  so  that  the  number  of  samples  it  contains  is 
compatible  with  the  size  of  the  training  set.  To  illustrate, 
suppose  one  is  creating  a  histogram  from  M  samples  where  the 
histogram  volumes  (attractors)  that  cover  feature  space  can 
vary  in  size.  Assume  each  volume  contains,  on  the  average, 
Ms  samples  and  that  Mv  volumes  contain  all  M  samples.  If  the 
volumes  are  disjoint,  then  M  =  MSMV.  The  question  is  how 
many  volumes  Mv  should  one  strive  to  have  and  how  many 
samples  Ms  per  volume.  We  would  like  Ms  to  be  large  so  that 
the  sample  probabilities  for  each  volume  can  be  accurately 
estimated.  However,  we  would  also  like  Mv  to  be  large  so  that 
the  joint  histogram  has  many  small  attractors  that  have 
sufficient  resolution  to  accurately  approximate  the  continuous 
joint  probability  density  over  all  of  feature  space.  A  natural 
way  to  balance  Mv  versus  Ms  is  by  maximizing 

J(MS,  Mv)  =  min  (  a  Mv,  Ms)  (15) 

under  the  constraint  that  Mv  Ms=  M  where  parameter  a  is  a 
tradeoff  weight.  The  solution  is  given  by 

Ms  =  a  M1/2  (16) 

and 

Mv  =  (1/a)  M1'2 

(We  are  ignoring  the  fact  the  solution  is,  in  general,  not  an 
exact  integer  because  it  is  not  critical  to  this  discussion.) 

Note  that  tradeoff  weight,  a,  is  the  "nearest  neighbor 
factor"  that  was  defined  earlier  in  Eq  8. 

1/2 

Furthermore  note  that  MS(M)  =  a  M  has  the  asymptotic 
properties  that 

Lim  MS(M)  —  oo  (17) 

M—>oc 

Lim  MS(M)  /  M  =  0 

M—>oo 

These  properties  imply  that  the  joint  histogram  will 
approach  the  true  joint  probability  density  as  M  approaches 
infinity. 

However  we  must  also  keep  in  mind  that  the  training  set 
may  not  be  balanced  (i.e.,  some  classes  may  have  more 
training  samples  than  others).  We  do  not  want  the  training 
process  to  favor  one  class  over  the  others  simply  because  it  has 
more  samples.  Accordingly,  each  sample  is  weighted  by 


[  M(class(k)  )  ]  I/" .  (18) 

Then  the  radius  of  an  attractor  is  given  as  the  smallest 
radius  such  that  the  sum  of  the  weights  of  the  samples 
contained  in  the  attractor  is  greater  than  or  equal  to  parameter 
a.  This  is  implemented  in  steps  4,  5,  and  6  of  the  KNNANN 
Training  algorithm.  For  the  special  case  where  the  attractor 
contains  only  class  i,  one  can  show  that  the  number  of  samples 
in  the  attractor  is  the  smallest  integer  that  is  greater  than  or 
equal  to 

a[M(i)]1/2.  (19) 

To  determine  the  sample  probabilities  within  an  attractor 
and  account  for  unbalanced  training  data,  each  sample  is 
weighted  by 

1  /  M(i).  (20) 

By  using  these  weights  in  a  counting  procedure  to 
determine  sample  probabilities,  all  classes  appear  to  have  the 
same  number  of  training  samples  (i.e.,  classes  appear  equally 
likely).  This  is  implemented  in  steps  8,  9,  and  10  of  the 
KNNANN  training  algorithm.  To  illustrate,  consider  the 
special  case  where  the  number  of  samples  from  class  /  that 
belongs  to  attractor  L  is  Pq  M(i)  for  all  i  where  Pq  is  any 
fractional  constant.  Then  pfi,  L)  will  equal  1/NC  for  all  /'. 

The  parameter  yfi)  is  used  to  control  the  amount  of 
attractor  overlap.  It  also  influences  the  number  of  attractors 
that  will  be  determined.  See  the  second  "if-statement"  in  step  9 
of  the  KNNANN  Training  algorithm.  Note  that  any  of  the 
feature  vectors,  which  have  not  been  flagged  in  step  9,  have  the 
potential  to  be  selected  as  an  attractor  center.  If  a  feature 
vector  belonging  to  class  i  is  within  radius  yfi)  r(L)  of  the 
center  of  attractor  L,  it  is  flagged  so  it  cannot  be  selected  as  a 
new  attractor  center  in  subsequent  passes  through  the 
algorithm.  Consequently,  if  yd)  =  0,  then  every  feature  vector 
belonging  to  class  i  will  be  selected  as  an  attractor  center.  For 
y(i)  =  0,  observe  that  attractors,  whose  centers  have  been 
selected  from  class  i,  have  the  potential  for  large  overlap 
because  a  new  attractor  center  is  permitted  to  be  very  near  an 
existing  attractor  center.  In  contrast,  for  y(i)  =  7,  observe  that 
attractors,  whose  centers  have  been  selected  from  class  i,  have 
a  more  limited  potential  to  overlap  because  a  new  attractor 
center  must  be  at  least  a  full  radius  away  from  all  existing 
attractor  centers. 

To  reduce  the  number  of  attractors  that  must  be  stored  and 
searched  during  evaluation,  it  is  recommended  to  use 
0.5  <  y(i)  <  1  when  Mfi)  is  large.  To  make  maximum  use  of 
class  i  data  when  Mfi)  is  small,  it  is  recommended  that  yfi)  =  0. 

Note  when  yfi)  =  7  for  all  /',  a  minimal  number  of 
attractors  is  generated;  and  when  yfi)  =  0  for  all  /,  the  largest 
number  of  attractors  is  generated.  In  practice,  yfi)  is  selected 
by  trial  and  error.  Typically,  for  networks  trained  with  1000  to 
4000  feature  vectors,  yfi)  is  selected  so  that  network  will  have 
between  250  and  500  attractors.  Note  that  training  time 
decreases  substantially  as  yfi)  becomes  large  because  more 
vectors  are  flagged  for  removal  as  attractors  are  selected;  this  is 


especially  useful  for  large  training  sets.  It  is  instructive  to  note 
that  the  number  of  passes  through  the  training  algorithm  (i.e., 
the  number  of  times  steps  3-11  are  executed)  is  finite  and 
cannot  exceed  N.  More  specifically,  if  yfi)  are  near  0,  the 
number  of  passes  is  of  order  A;  if  yfi)  are  near  7,  the  number  of 
passes  is  of  order  N1/~ . 

The  training  algorithm  can  be  made  faster  by  noting  that 
only  a  partial  sort  is  required  in  step  5. 

Because  the  attractors  are  likely  to  overlap,  an  arbitrary 
feature  vector  is  often  contained  in  more  than  one  attractor. 
This  is  beneficial  because  it  allows  one  to  obtain  a  better 
estimate  of  the  class  probabilities  by  interpolating  the  sample 
probabilities  from  each  attractor.  This  interpolation  is  done  in 
the  KNNANN  Evaluation  algorithm.  It  is  a  weighted  average 
of  the  attractors'  sample  probabilities  where  the  weights  favor 
the  attractors  whose  centers  are  nearest  the  feature  vector  that 
is  being  classified. 

The  design  parameter  /?.  the  radius  expansion  factor  that  is 
used  in  the  KNNANN  Evaluation  algorithm  (Eq  12),  controls 
which  attractors  to  associate  with  the  feature  vector  that  is 
being  classified.  If  the  feature  vector  is  not  contained  in  any 
attractor,  it  is  considered  as  statistically  dissimilar  to  the 
training  set  and,  consequently,  the  class  is  declared  as 
unknown.  Flowever,  because  the  design  of  the  network  was 
based  on  a  finite  training  set,  it  is  statistically  possible  for  a 
new  feature  vector  not  to  be  contained  in  any  attractor  but  near 
the  perimeter  of  some  of  them.  The  expansion  factor  is  used  to 
increase  an  attractor's  radius  of  influence  so  that  it  will  capture 
these  nearby  feature  vectors.  In  addition  it  permits  more 
attractors  to  be  involved  in  the  interpolation  process. 

IV.  ODFC 

The  ODFC  uses  the  same  notation  from  Eqs  1  and  2. 
Flowever  the  feature  vector  is  augmented  with  an  additional 
component,  which  is  set  to  unity.  Namely, 

Xn+lfk)  =  1  (21) 

As  will  be  obvious  later,  the  linear  coefficient  that 
multiplies  this  component  acts  as  an  offset  bias  and  allows  the 
hyperplane  (as  observed  in  the  original  /7-dimensional  feature 
space)  to  optimally  shift  and  better  align  with  class  boundaries. 
For  the  remaining  ODFC  sections,  x  is  assumed  to  be  the 
augmented  («+7)-dimensional  feature  vector. 

Define  a  bank  of  coefficient  vectors  as 

c(i,j)  =  the  j- th  coefficient  vector  associated  (22) 

with  class  i 

=  (7?+7)-dimensional  column  vector 

where 

i  =  1,  2 , ...  Nc  and  j  =  1,  2 , ...  N/i)  (23) 

Nf(i)  =  number  of  coefficient  vectors  associated  with 
class  i 

The  first  /?  components  of  c(i,j)  define  the  normal  vector 
of  a  hyperplane  in  the  original  /7-dimensional  feature  space. 
The  n+1  component  of  c(i,j)  accounts  for  the  shift  of  the 


(32) 


hyperplane  relative  to  the  origin  of  the  77-dimensional  feature 
space. 

Define  the  output  of  the  filters  as  the  projection  of  x  along 
c(i,j);  namely, 

s(x,  cfi,  j))  =  c’(i,  j)  x  (24) 


where  '  is  the  matrix-vector  transpose  operator. 

The  training  goal  is  to  determine  c(i,j)  such  that,  on  the 

1 

average,  s  (x(k),  c(i,j))  is  large  for  xfk)  belonging  to  class  i 
and  is  small  for  x(k)  not  belonging  to  class  i  for  j  =  1,  2,  ... 
N/i). 

Define  the  correlation  matrix 


Afi)  =  ( l/Mfi)  )  X  xfk)  x'fk) 

(25) 

xfk)  eclass  i 

(l/Mfi))  X  s2 (xfk),  c)  =c'A(i)c. 

(26) 

x  (k)  eclass  i 


For  clarity,  the  indices  on  c(i,j)  have  been  dropped.  The 
solution  is  found  by  maximizing  J(c)  with  respect  to  c  where 


c'A(i)  c 

J(c)=  - 

c'  Bfi)  c 
and 

Bfi)  =  (1  /  (Nc- 1)  )  I  Afm) 
m  gi 


(27) 

(28) 


One  notes  that  the  numerator  of  Jfc)  is  the  sample 

2 

expectation  of  Cfi,  c)  over  the  training  set  given  that  x  belongs 
to  class  i.  And  the  denominator  of  J(c)  is  the  sample 
expectation  of  s  (x,  c)  over  the  training  set  given  that  x  does 
not  belongs  to  class  i  under  the  assumption  of  equally  likely 
classes.  That  is 

2 

Es  [  s  (x,  c)  |  x  eclass  i  ] 

J(c)  =  - - - - -  (29) 

Es  [  s  (x,  c)  |  x  gclass  i  ] 


where  Es\ .  |  denotes  the  sample  expectation  (or  sample  mean). 

One  recognizes  J(c )  as  the  Raleigh  quotient  implying  the 
solution  is  found  by  solving  the  generalized  eigenvalue 
problem: 

A(i)  v  -  A  Bfi)  v  =  0  (30) 


If  the  eigenvector  v  and  eigenvalue  A  is  a  solution  to  Eq  30, 
then 

2 

v'A(i)  v  Es  [  s  (x,  v)  |  x  eclass  i  ] 

A  = - - - =  - - - - -  (31) 

V  B(i)  v  Es  [  s  (x,  v)  |  x  gclass  i  ] 

This  implies  that  c(i,  1),  c(i,  2), ...  c(i,  Nf(i))  should  be  selected 
to  be  all  the  eigenvectors  whose  eigenvalues  are  large  relative 
to  1. 

If  c(i,j)  is  scaled  as  follows, 


c(i,j)=v/(v'B(i)v)1/2 
it  is  easy  to  show  that, 

Es[  s  (x,  c(i,j))  ]  =  A  forxe  class  i  (33) 

2 

Es[  s  (x,  c(i,j))  ]  =  1  for  x  g  class  i 
Selection  of  eigenvectors  for  c(i,  i) 

The  procedure  for  determining  which  eigenvectors  to  select 
uses  two  selectable  design  parameters,  El  and  K.  All 
eigenvectors  are  selected  whose  eigenvalues  are  greater  than 
H.  However,  if  more  than  K  eigenvectors  have  been  selected, 
only  the  eigenvectors  corresponding  to  the  K  largest 
eigenvalues  are  kept.  Because  A(i)  and  B(i)  are  (n+1)  x  (n+1) 
matrices,  the  maximum  number  of  eigenvectors  is  n+1. 
Consequently  Nf  (i)  obeys  0  <  Nf  (i)  <  K  <  n+1.  In  our 
applications  the  choices  of  H  =  2.0  and  K  =  min(5,  n+1)  have 
performed  well.  If  no  eigenvalues  are  greater  than  H  (i.e., 
Nffi)  =  0),  then  one  selects  the  eigenvector  corresponding  to 
the  maximum  eigenvalue  (so  Nffi)  will  be  at  least  1).  When  the 
eigenvalues  are  not  large,  the  implication  is  that  the  features 
are  not  providing  good  discrimination  and  classifier 
performance  is  generally  poor. 

Finally,  the  selected  eigenvectors  are  scaled  as  described 
above  in  Eq  32  and  stored  in  c(i,j). 

The  ODFC  training  process,  described  by  Eqs  25  to  32,  is 
repeated  for  i  =  1,  2, ...  Nc. 

Conditioning  matrix  Afi) 


If  the  number  of  class  samples  is  sufficiently  large  and  the 
features  are  linearly  independent,  then  Afi)  and  Bfi)  are  real 
symmetric  positive  definite  matrices.  Fast  and  accurate 
computer  routines  for  solving  the  generalized  eigenvalue 
problems  having  such  matrices  are  readily  available  in  many 
linear  algebra  software  packages  (e.g.,  IMSL  and  MATLAB). 
To  increase  robustness  to  random  errors  on  the  feature  vector 
and  mitigate  the  possibility  of  ill-conditioning  when  solving 
the  generalized  eigenvalue  problem,  the  first  n  diagonal 
elements  of  Afi)  are  multiplied  by  factor  (1  +  s)  prior  to 
generating  Bfi).  One  should  try  several  values  for  s  and  choose 
the  one  that  gives  the  best  results  with  a  test  set  of  feature 
vectors.  In  many  of  our  applications  the  choice  of  s  =  0.0025 
has  performed  well. 

Summary  of  the  ODFC  Training  Process 

The  ODFC  Training  process  is  summarized  as  follows. 
Initially  augment  each  training  feature  vector  with  one 
additional  component  and  set  it  equal  to  1 . 

For  i  =  1,  2 ,  ...Nc 

1 .  Compute  Afi)  from  Eq  25 . 


2.  As  described  above  in  subsection  "Conditioning  matrix 
A(i), "condition  the  diagonal  elements  of  A(i). 

3.  Calculate  B(i)  by  Eq28. 

4.  Solve  for  the  eigenvectors  and  eigenvalues  of  Eq  30. 

5.  As  described  above  in  subsection  "Selection  of 
eigenvectors  for  c(i,j),"  choose  parameters  H  and  K  and 
use  them  to  select  the  appropriate  subset  of  Nj(i) 
eigenvectors. 

6.  According  to  Eq  32,  scale  the  selected  eigenvectors  and 
save  them  in  c(i,j)  for  j  =  1,  2,  ...  Nj(i). 

End  of  For  i  loop 

ODFC  Evaluation  process 

Using  c(i,j)  for  i  =  1,  2,  ...  Nc  and  j  =  1,  2,  ...  Nf(i),  the 
class  of  a  new  feature  vector  x  is  determined  as  follows. 
Define 

z(i  |  x )  =  max  ( \  c'(i,j)  x  \ )  (34) 

j 

Imax  =  argmaxf  z(  i  \  x  )  ) 
i 

I  ref  =  argmax(  z(  i  \  x )  ) 
i  ^  Imax 

T  =  a  designer  selectable  classification  threshold. 

The  classification  rule  is  given  by: 

If  z(Imax\x)  -T  z(Iref\x),  then  class  =  7mar;  otherwise  the 
class  is  considered  uncertain. 

For  a  two-class  problem  (e.g.,  class  1  =  target  and 
class  2  =  nontarget),  a  classification  threshold,  T,  can  be 
introduced  to  adjust  the  probability  of  classification  relative  to 
the  probability  of  false  alarm.  The  classification  rule  is  given 
by: 

\f  z(  1  \  x )  >  Tz(  2  \  x),  then  class  =  1 ;  otherwise  class  =  2. 

V.  CONCLUSION 

This  paper  presented  detailed  descriptions  of  the  K-Nearest 
Neighbor  Attractor-based  Neural  Network  (KNNANN)  and  the 
Optimal  Linear  Discriminatory  Filter  Classifier.  Both  are 
feature-based  classifiers  whose  training  is  based  on  supervised 
learning. 

The  KNNANN  is  a  probabilistic-based  neural  network  that 
constructs  a  joint  histogram,  which  approximates  the  joint 
posteriori  class  probability  density  function.  Some  novel 
attributes  of  the  KNNANN  are: 

(1)  A  training  procedure  that  balances  the  size  of  the 
attractors  so  that  they  are  small  enough  to  resolve  the 
underlying  probability  density  but  large  enough  to  hold  a 
sufficient  number  of  training  samples  to  accurately  estimate  the 
class  probabilities. 

(2)  A  compensation  method  for  unbalanced  training  data 
sets  (i.e.,  data  sets  where  the  number  of  samples  in  each  class 
differ  substantially). 


(3)  An  interpolation  method  that  estimates  class 
probabilities  by  combining  sample  probabilities  of  several 
overlapping  attractors. 

The  ODFC  employs  multiple  linear  classifiers  that  are 
combined  to  determine  the  class  of  a  feature  vector. 
Coefficients  of  the  linear  filters  are  derived  by  solving  a 
generalized  eigenvalue  problem. 

Because  the  KNNANN  uses  hyperspheres  to  define  class 
boundaries  in  feature  space  and  the  ODFC  uses  hyperplanes, 
the  two  classifiers  are  complementary  and  have  performed  well 
when  "anded." 

The  training  algorithms  for  both  classifiers  are  not  iterative 
and  computationally  very  fast.  This  allows  one  to  use  the 
classifiers  to  evaluate  a  large  number  of  feature  subsets  taken 
from  a  large  set  of  candidate  features  and  select  the  subset  that 
performs  the  "best"  or  "good  enough,"  thereby  overcoming 
Bellman's  well-known  curse  of  dimensionality. 
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